1 

Covariance estimation in decomposable 
Gaussian graphical models 

Ami Wiesel, Yonina C. Eldar and Alfred O. Hero III 

Abstract 

Graphical models are a framework for representing and exploiting prior conditional independence structures 
within distributions using graphs. In the Gaussian case, these models are directly related to the sparsity of the 
inverse covariance (concentration) matrix and allow for improved covariance estimation with lower computational 
complexity. We consider concentration estimation with the mean-squared error (MSB) as the objective, in a special 
type of model known as decomposable. This model includes, for example, the well known banded structure and other 
cases encountered in practice. Our first contribution is the derivation and analysis of the minimum variance unbiased 
estimator (MVUE) in decomposable graphical models. We provide a simple closed form solution to the MVUE and 
compare it with the classical maximum likelihood estimator (MLE) in terms of performance and complexity. Next, 
we extend the celebrated Stein's unbiased risk estimate (SURE) to graphical models. Using SURE, we prove that the 
MSE of the MVUE is always smaller or equal to that of the biased MLE, and that the MVUE itself is dominated by 
other approaches. In addition, we propose the use of SURE as a constructive mechanism for deriving new covariance 
estimators. Similarly to the classical MLE, all of our proposed estimators have simple closed form solutions but result 
in a significant reduction in MSE. 

I. Introduction 

Covariance estimation in Gaussian distributions is a classical and fundamental problem in statistical signal 
processing. Many applications, varying from array processing to functional genomics, rely on accurately estimated 
covariance matrices [1], [2]. Recent interest in inference in high dimensional settings using small sample sizes has 
caused the topic to rise to prominence once again. A natural approach in these settings is to incorporate additional 
prior knowledge in the form of structure and/or sparsity in order to ensure stable estimation. Gaussian graphical 
models provide a method of representing conditional independence structure among the different variables using 
graphs. An important property of the Gaussian distributions is that conditional independence among groups of 
variables is associated with sparsity in the inverse covariance. Due to the sparsity, these models allow for efficient 
implementation of statistical inference algorithms, e.g., the iterative proportional scaling technique [3], [4]. 
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Over the last years, statistical graphical models were successfully applied to speech recognition [5], [6], image 
processing [7], [8], sensor networks [9], computer networks [10] and other fields in signal processing. Efficient 
Bayesian inference in Gaussian graphical models is well estabUshed [11], [12], [13]. Estimation of deterministic 
parameters received less attention, but have also been considered implicitly, e.g., the recent works on inverse 
covariance structure in the context of state of the art array processing [14], [15]. 

Estimation of deterministic parameters in Gaussian graphical models is basically covariance estimation since 
the Gaussian distribution is completely parameterized by second order statistics. The most common approach to 
covariance estimation is maximum likelihood. When no prior information is available, this method yields the sample 
covariance matrix. It is asymptotically unbiased and efficient but does not minimize the mean-squared error (MSE) 
in general. Indeed, depending on the performance measure, better estimators can be obtained through regularization, 
shrinkage, empirical Bayes and other methods [16], [17], [18], [19], [20], [21], [22]. 

Covariance estimation in Gaussian graphical models involves the estimation of the unknown covariance based 
on the observed realizations and prior knowledge of the conditional independence structure within the distribution 
[4], [3], [23], [24]. The prior information allows for better performance with lower computational complexity. 
Decomposable graphical models, also known as chordal or triangulated, satisfy a special structure which leads to a 
simple closed form solution to the maximum likelihood estimate (MLE) as well as elegant analysis. These models 
include many practical signal processing structures such as the banded concentration matrix and its variants [14], 
[25], [21], [22] as well as multiscale structures [7], [8]. 

Covariance selection is a related topic which addresses the joint problem of covariance estimation and graphical 
model selection. This setting is suitable to many modern applications in which the conditional independence structure 
is unknown and must be learned from the observations. Numerous selection methods have been recently considered 
for both arbitrary graphical models [26], [27], [28], [29] and decomposable models[30], [31]. Clearly, these methods 
are intertwined with covariance estimation. For example, the latter is the core of the intermediate steps in many of 
the advanced greedy stepwise selection methods. 

In this paper, we consider inverse covariance estimation in decomposable Gaussian graphical models with the 
mean-squared error (MSE) as our objective. Except for the prior conditional independence structure, we do not 
assume any prior knowledge on the covariance and treat it as an unknown deterministic parameter. Our main 
contribution is the derivation of the minimum variance unbiased estimator (MVUE) of the inverse covariance. 
Similarly to the MLE, the MVUE has a simple closed form solution which can be efficiently implemented in a 
distributed manner. Moreover, it minimizes the MSE among all unbiased estimators. We also prove that it has 
smaller MSE than the biasd MLE. The proof is based on an extension of the celebrated Stein unbiased risk estimate 
(SURE) [16], [32], [33] to Gaussian graphical models. Using SURE we prove that the MVUE dominates the MLE 
in terms of MSE, i.e., its MSE is always smaller or equal to that of the MLE. In addition, we prove that the MVUE 
itself is dominated by other biased estimators. Next, we propose the use of SURE as a method for hyper-parameter 
tuning in existing covariance estimation approaches, e.g., the conjugate prior based methods proposed in [34], [35]. 

The outline of the paper is as follows. We begin in Section II where we formally define decomposable graphical 
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models, provide a few illustrative applications, and formulate the estimation problem. In Section III, we review 
the classical MLE approach and derive the finite sample MVUE. Next, in Section IV we consider SURE and 
its applications. While our estimators have lower MSE, they require more samples in order to ensure positive 
semidefiniteness. This issue is addressed in Section V. We evaluate the performance of the different estimators 
using numerical simulations in Section VI, and offer concluding remarks in Section VII. 

The following notation is used. Boldface upper case letters denote matrices, boldface lower case letters denote 
column vectors, and standard lower case letters denote scalars. The superscripts (•)^ and (•)~^ denote the transpose 
and matrix inverse, respectively. For sets a and b, the cardinality is denoted by \a\ and the set difference operator 
is denoted by a \ &. The operator ||X|| denotes the Frobenius norm of a matrix X, namely ||X|p = Tr(X^X), 
and X :^ means that X is positive definite. The zero fill-in operator [ J^ outputs a conformable matrix where the 
argument occupies its appropriate sub-block and the rest of the matrix has zero valued elements (See [3J for the 
exact definition of this operator). 

II. COVARIANCE ESTIMATION IN GRAPHICAL MODELS 

In this section, we provide an introduction to decomposable Gaussian graphical models based on [3] along with 
a few motivating apphcations for their use in modem statistical signal processing. Then, we formulate the inverse 
covariance estimation problem addressed in this paper. 

A. Decomposable Gaussian graphical models 

Graphical models are intuitive characterizations of conditional independence structures within distributions. 
An undirected graph Q = (V, £) is a set of nodes V = connected by undirected edges E = 

{(*ii.7i) I ■ ■ ■ {ilE], j\E\)}, where we use the convention that each node is connected to itself, i.e., {i, i) £ E for all 
i E V. Let X be a zero mean random vector of length p = \V\ whose elements are indexed by the nodes in V. The 
vector X satisfies the Markov property with respect to Q, if for any pair of non-adjacent nodes the corresponding 
pair of elements in x are conditionally independent of the remaining elements, i.e., [xj^ and [x]^ are conditionally 
independent of [x]y^-^ for any {i.j} ^ E. For the multivariate Gaussian distribution, conditional indpendence is 
equivalent to conditional uncorrelation: 

E { ([x]. - E { [x].| [x]^\. .}) ([x] . - £ {[x] . |[x]^\. . }) I [x]^^. .} = 0, {i,j} ^ E. (1) 

Simple algebraic manipulations (see Appendix I) show that (1) is equivalent to 

[K].^. = for all {i.3}iE. (2) 

where K is the concentration (inverse covariance) matrix of x defined as 

VL={E{^^])~\ (3) 

To summarize, Gaussian graphical models are directly related to sparsity in the concentration matrix K. Therefore, 
throughout this paper we focus on K and parameterize the Gaussian distribution using the concentration matrix 
rather than the covariance matrix. 
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Decomposable models are a special case of graphical models in which the conditional independence graphs 
satisfy an appeahng structure. A graph (or subgraph) is complete if all of its nodes are connected by an edge. A 
subset c CV separates a CV and b CV ii all paths from a to 6 intersect c. A triple {a, c, b} of disjoint subsets 
of 5 is a decomposition oi G if V = aU cL) b, c separates a and b, and c is complete. Finally, a graph Q is 
decomposable if it is complete, or if it can be recursively decomposed into two decomposable subgraphs Qauc and 

GcUb- 

It is convenient to represent a decomposable graph using cliques. A clique is a maximal complete subset of 
nodes. If ^ is decomposable then there exists a sequence of cliques Ci , • • ■ , Ck which satisfy a perfect elimination 
order. This order implies that {Hj-i\Sj, Sj, Rj} is a decomposition of the subgraph Qh^ where 

Hj = Ci U Ca U • • • U Cj, j = l,--- ,K (4) 

Sj = Hj_inCj, j = 2,---,K (5) 

= 2,---,K. (6) 



i-i' 



Cfe and \Sk \ = Sk, so that 



For later use, we denote the cardinalities of the cliques and separators by \Ck 

K K 

^Ck-^Sk=p. (7) 

fe=l fe=2 

Consequently, if a Gaussian multivariate x satisfies a decomposable graphical model over the graph Q, then its 
concentration matrix K belongs to the set of decomposable sparse positive semidefinite matrices: 

/C = {K:K^0, [K]^,_,\s,,ii, =0> i = 2,...,if}. (8) 

Decomposable graphical models appear in many signal processing appUcations. We now review a few represen- 
tative examples: 

• Diagonal or block diagonal: A trivial graphical model is the diagonal or block diagonal model, in which the 
cliques are non-overlapping. For example, the following matrix has two cUques Ci = {1, 2} and C2 = {3, 4, 5}: 

□ □ 



K 



□ □ 



(9) 



□ □ □ 

□ □ □ 

□ □ □ 

Two coupled blocks: The simplest non-trivial decomposable graphical model is the two coupled blocks. For 
example, the following matrix has two cUques Ci = {1, 2, 3} and C2 = {3, 4, 5} coupled through ^2 = {3}: 

□ □ □ 

□ □ □ 

□ □ □ □ □ 

□ □ □ 

□ □ □ 




K = 



(10) 
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Note that the definition of decomposable model is that they can be recursively subdivided into this main 
building block. 

Banding: Another frequently used decomposable graphical model is the i'th order banded structure in which 
only the L + 1 principal diagonals of K have non-zero elements. For example, the following matrix is banded 
with L = 2: 



K 



□ 


□ 








□ 


□ 


□ 








□ 


□ 


□ 








□ 


□ 


□ 








□ 


□ 



(11) 



and has for cliques Ci = {1,2}, C2 = {2.3}, C3 = {3,4} and C4 = {4,5}. It is appropriate whenever the 
indices of the multivariate represent physical quantities such as time or space, and the underlying assumption 
is that distant variables are conditionally independent of closer variables. A special case of this structure is 
the autoregressive (AR) model. The AR model is stationary and leads to a banded Toeplitz matrix. The more 
general banded graphical model corresponds to a non-stationary autoregressive process. It was recently shown 
that this structure is a good model for state-of-the-art radar systems [14] (see also [25]). A natural extension of 
the L'th banded model is differential banding in which multiple band lengths are utilized. It is straightforward 
to show that the corresponding graph is still decomposable with cliques of different cardinalities. This form 
was empirically validated to be a reasonable model in call center management in operations research [35]. 
Arrow (Star): Another common decomposable graphical models takes the form of an arrow motif in the 
concentration matrix. This structure is appropriate when there is a single common global sub-block and 
numerous local sub-blocks which are conditionally independent given the global variables. For example, the 
following concentration matrix specifies an arrow graphical model 

XJ) □ □ □ □ □ 

□ □ 

K= □ □ , (12) 

□ □ 

□ □ 

with cliques Ci = (1, 2}, C2 = (1, 3}, C3 = (1, 4} and C4 = (1, 5}. A typical signal processing application 
is a wireless network in which the global node is the access point and the local nodes are the terminals. Other 
applications of these models are discussed in [36]. 
• Multiscale: A connmon graphical model in image processing is based on the multiscale (multiresolution) 
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framework. Here, the decomposable graph is a tree of nodes (or cliques) [7], [8]: 

□ □ □ 

□ □ □ □ 

□ □ □ □ 
K= □ □ . (13) 

□ □ 

□ □ 

□ □ 

B. Problem formulation 

We are now ready to state the problem addressed in this paper. Let x be a length-p zero mean Gaussian random 
vector, with concentration matrix K G /C as in (8). Given n independent realizations of x denoted by {x[7]}f^]^, 
and knowledge of the conditional independence structure, our goal is to derive an estimate K of K with minimum 
MSE, where the MSE is defined as 

2~ 



MSE (K) 



E 



K-K 



(14) 



Here the norm is the matrix Frobenius norm. The MSE in (14) is a function of the unknown parameter K and 
cannot be minimized directly. This dependency is the main difficulty in minimum MSE estimation of deterministic 
parameters, in contrast to the Bayesian framework in which the MSE is a function of the distribution of K but not 
of K itself. More details on this issue can be found in [37]. 

Due to the difficulty of minimum MSE estimation, it is customary to restrict attention to unbiased estimators. 
For this purpose, the MSE is decomposed into its squared bias and variance components defined as 



where 



MSE (K) 

BIAS^ (K) 
VAR(K) 



BIAS^ (K) + VAR (K) 



(15) 



E 



K 



E 



K-E 



{■^}| 



(16) 



We call K an unbiased estimator if BIAS (K) = 0. Although the variance may also depend on K, asymptotically 
in many cases an unbiased estimate exists that minimizes the MSE. 

Our choice of MSE of K as a performance measure requires further elaboration. There are numerous competing 
metrics which could have been adopted: MSE of K~^; weighted norms; KuUbuck-Leibler based distances and 
others. Each of these measures wiU lead to different estimators. We focus on the MSE of the inverse covariance 
due to the following reasons. Graphical models specify the structure of the concentration matrix rather than the 
structure of the covariance matrix so that the ocnetnraiton seems to be a more intuitive paramter. Furthermore, the 
concentration is the natural parameter of the Gaussian distribution as an exponential family [3]. It is parameterized 
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by the free variables associated with the cUques, and has zero values elsewhere. In contrast, the covariance matrix 
is not a natural parameter and has unnecessary and functionally dependent variables outside the cliques. When 
included in the performance measure, these variables mash the behavior of the free variables within the cliques. 
Some of the results in this paper, such as the SURE identity, can also be applied to other performance measures. 

III. Maximum likelihood and minimum variance unbiased estimation 

In this section, we review the classical MLE approach to inverse covariance estimation in decomposable Gaussian 
graphical models and then derive the MVUE estimator. 

A. Maximum likelihood estimation 

We begin with a short review of the sufficient statistics and MLE approach. For a more detailed treatment the 
reader is referred to [3]. 

When no prior information is available and the model consists of one clique Ci = {1, • • • the model is said 
to be saturated. In this case, a minimal sufficient statistic for estimating K is the sample covariance matrix 

n 

S = ^x[i]x^[i]. (17) 

i=l 

Its distribution is Wishart with n degrees of freedom and natural parameter K 

p{S;K) = Wp(S;n,K) 

ISl'^IKI^ iTr/TCSl 



(f ) 

where Fp (•) is the multivariate Gamma function. In graphical models, the known conditional independence structure 
specifies some of the entries of K and the complete sample covariance is no longer necessary. A minimal sufficient 
statistic is the incomplete matrix S defined as 

[S]^. = \ L J^,, ^'JS ^^^^ 

I ? else 

where ? denotes unspecified elements. In particular, if Q is decomposable then only the sub-blocks associated with 
cUques and separators 

Sfe = [S]c,,c,^ k = l,---,K 

S[fe] = [S]s^^s,^ k = 2,---,K (20) 
are necessary. Their marginal distributions are also Wishart 
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Their joint distribution, denoted by p (S), is derived by marginalization of the distribution of S over the unnecessary 
elements, and is known as the hyper Wishart distribution [24] 

p(S;K) = ) — ^. (22) 

The simple product form of the distribution is a direct consequence of the decomposable structure. 

Gaussian graphical models are regular models in the exponential family. By appropriate definitions that transform 
the variables from matrices to vectors and take into account the symmetry, both the Wishart and the hyper Wishart 
distribution can be written as natural exponential distributions. In Appendix II, we provide a few known results on 
this family that will be used in the sequel. 

The MLE of K is defined as' 

Kmle = argmaxlogp (S; K) , (23) 
and has the following closed form solution [3] 

^ ^ r 10 

(24) 

k=l fe=2 

It exists with probability one if and only if n > max^ c/j, in which case it is positive semidefinite. It is locally 
consistent in the sense that the local and global versions of the cliques agree with each other: 



■•^MLE 



= -Sfe, k = l,---,K. (25) 



Both of these properties suggest that the MLE in a decomposable model performs as if the model was block 
diagonal with non-overlapping cliques Cfe. 

In general, the MLE is a biased estimator and does not minimize the MSE. One of the main motivations for 
the MLE is that asymptotically in n it is an MVUE. Therefore, we now address the finite sample MVUE in 
decomposable graphical models. Interestingly, we will show that the MVUE does not behave as if the model was 
block diagonal and improves performance by taking into account the coupling between the cliques. We will also 
see that it dominates the MLE, namely its MSE is smaller for all possible values of K. 

B. Minimum variance unbiased estimation 

For finite sample size, the MVUE is provided in the following theorem. 
Theorem 1: The estimator 

K K 



^MVUE = ^[{n-Ck-l)S^^]° -^\(n- Sk-l)S^^ 



(26) 



A)=l k=2 

exists with probability one if n > max^ c^, and is the minimum variance unbiased estimator (MVUE) of K e /C 
given the incomplete sample covariance matrix S. 

'An alternative but equivalent definition is Kmle = argmaxKgK: logP (S; K). 
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The proof is available in Appendix II and is based on the general MVUE for exponential family distributions. 

Theorem 1 specifies the MVUE of K in decomposable graphical models. The estimator is similar in structure 
and complexity to the MLE in (24) and it is easy to see that asymptotically in n the two estimators are equivalent. 
In the saturated case the MVUE is a scaled version of the MLE. In many signal processing appUcations (e.g., 
principal component analysis) the overall performance is indifferent to a change in scaling of the covariance. In 
decomposable graphical models, the MVUE is not a simple re-scaling of the MLE and there may be practical 
performance gains to its use with almost no additional cost in computational complexity. 

Recall that the MLE requires only n > maxfc Ck samples in order for it to exist and be positive definiteness. 
This is not true for the MVUE which may require more samples to ensure positive semidefiniteness. For example, 
consider a simple K = 2 chques model. Using the matrix inversion formula for partitioned matrices [38, page 572], 
it can be verified that 

1 



^MVUE 



\S]a d . (27) 

A necessary condition for positive definiteness with probabiUty one of Kmvue is [K^vue] ■52,^2 which is 
equivalent to n > p + 1. Thus, although n > maxfe Cfe suffices for existence of an estimator it is meaningless unless 
n> p+1. Identity (27) may suggest that the MVUE is locally consistent but it can be verified that this is not true, 
i.e.. 



^MVUE 



^ ^ ^ ^—r^k (28) 



for k = 1,2. Evidently, in contrast to the MLE, the MVUE does not behave as if the model were block diagonal 
and it accounts for the couphng between the chques. 

The MVUE minimizes the MSE over the class of unbiased estimators. This is an important property but it does 
not ensure optimality over all estimators, biased or unbiased. In the next section, we prove that the MVUE actually 
dominates the biased MLE in terms of MSE performance. 

IV. Stein's unbiased risk estimate (SURE) 

Stein's unbiased risk estimate (SURE) provides an unbiased approximation of the MSE. The SURE approach 
was originally applied to the estimation of a Gaussian mean parameter [16]. It was generalized to the Wishart 
distribution in [16], [17], and later extended to the natural parameters of any exponential family distribution in [32], 
[33]. The following theorem extends these results to Gaussian graphical models. 

Theorem 2: Lei S be an incomplete sample covariance matrix associated with a graphical model Q = {V, E} as 
defined in (19), and let H (S) be a continuously differentiable matrix function of S. Then, 

E {Tr {H (S) K}} = £ {tV {h (S) Kmvue + 2VH (S)} } (29) 

where Kmvue is the MVUE of K given S and the differential operator is a px p matrix with elements 



hai- i^j,{iJ}^E (30) 
else. 
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The proof is available in Appendix II and is based on the general SURE approach to estimation in exponential 
family distributions. The theorem holds for any Gaussian graphical model but is useful only when the MVUE 
can be evaluated. In the decomposable case, the MVUE is provided in (26) and the differential operator has the 
appeahng form: 



K 



K 



(31) 



fe=i 



fe=2 



where Vfe = [V]^^ and V[fc] = [V]g^ are the Cfe x Cfe and Sk x Sk differential operators within the saturated 
cUques and the saturated separators, respectively. 

In the following subsections, we apply SURE to the derivation and MSE analysis of several estimators. 

A. MVUE dominates MLE 

Our first application of Theorem 2 is to prove that the MLE is inadmissible and dominated by the MVUE: 
Theorem 3: The MVUE in (26) dominates the MLE in (24) in terms of MSE: 



E- 



K 



MVUE 



-K 



Kmle — K 



for all K in the set K, defined in (8). 
Proof: The difference in MSEs is 

5 = E 



Kmvue — K 

2 



E 



K 



E{ K 



E- 



MVUE 



IIHII 



4Tr 



Kmle 1 1 -2Tr{HK}^ 
{VH}} 



where we appUed Theorem 2 with 



H 



Kmvue — Kmle 

K 



K 



U=l fe=2 

Therefore, in order to prove that 5 < it is sufficient to show that 



Tr{VH} > 0. 



From [17, (5.4)iii]: 



Tr{V,S-} = -iTr{S-}- V{S,T^} 

Tv{v[,]S-} = 4^{sr.f}-V{s-} 



Therefore, 



Tr{VH} 



> 2 (^'^ + 1) 



E (c. + 1) [Tr {S-} + Tr^ {S^}] - ^ + 1) {s^} + Tr^ {s^}] 

fc=l fc=2 

ETr {S-} - Tr {S-} +Tr^ {S^} - Tr^ {s^} 



.fc=2 



> 



(32) 



(33) 



(34) 



(35) 



(36) 



(37) 
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where we have used Ck > Sk ioi k = 2, ■ ■ ■ , K and 

Tr{S,-i}>Tr{S[,i}, k = 2,- 
Tr{S,-^}>Tv{S[-,f}, fc = 2,. 

The last two inequahtes follow from Appendix III. 



,K, 



(38) 
(39) 



B. MVUE is inadmissible 

We now use SURE to prove that the MVUE is inadmissible since its MSE can be improved upon by another 
biased estimator. 

Theorem 4: The estimator 

1 _ 



K K 

Kbe = ^[(n-Cfc-l)S^i]°-^[(n-Sfc-l)S[,| 



fc=i 



fe=2 



TV{S} 



dominates the MVUE in (26) in terms of MSE: 



E 



Krr — K 



< E 



K 



MVUE 



K 



for all K in the set K defined in (8). 
Proof: The difference in MSEs is 











5 = E^ 


Kbe-K 




Kmvue — K 



= El- 



= El- 



2Tr I Kmvue I 
Tr{S} 

2Tr|KK,vuE} 



P 



TV{S} 



E 



P 



El- 



4Tr|V: 

< 



Tr^ {S} 

P 

Tr^ {S} 

1 



2Tr 



Tr{S} 



IK 



Tr 



Kmvue \ ( i 
— -— ^ +4Tr<^ V 



TV{S} 



Tr2{S} ' ' Tr{S} 

3p 



Tr^ {S} 

where we applied Theorem 2 with H = tj^XsjI and used the identity 



(40) 



(41) 



(42) 



(43) 



Theorem 4 proves the inadmissibiUty of the MLE and MVUE in any decomposable graphical model. This 
contribution extends the results in [39], [36], [40]. The specific form of Kbe is not of significant importance and 
was chosen for simplicity. It is based on a similar Efron-Morris type estimator derived for saturated models in [18]. 

Finally, it is worth mentioning that Theorem 4 is an example of the well known Stein's phenomenon in which 
the simultaneous estimation of multiple unrelated parameters can be more accurate than estimating them separately. 
Indeed, the simplest case of decomposable models is the diagonal (or block diagonal) inverse covariance matrix 
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in which Sfc are statistically independent of each other and depend on different parameters. Theorem 4 establishes 
that joint estimation can be improved by global shrinkage. 



C. SURE based parameter tuning 

The main application of SURE in signal processing is parameter tuning [41], [33], |42J. Thus, we now illustrate 
how automatic parameter tuning in decomposable graphical models can utilize SURE. 

Consider a class of estimators parameterized by one or more parameters. For simplicity, we restrict ourselves to 
a special class of estimators with one design parameter: 



K K 

= 5^[(n-Cfc-l-d)S^i]°-^[(n-Sfc-l-d)S[,[ 



(44) 



fc=l k=2 

parameterized by d. Given this class of estimators, we would like to find the parameter d which minimizes the 
MSE: 

2^ 



mini? • 

d 



|||Kd-K||'|, 



or excluding constant terms 



mini? 

d 



2Tr KdK 



(45) 



(46) 



Solving (46) is difficult as the expectation and the second term in the objective depend on K which is unknown. 
Instead, we propose to use the SURE result in Theorem 2 and replace the unknown MSE with its unbiased estimate: 



mm 

d 



2Tr KdKMvuE + 2VKd 



}■ 



Substitution of from (44) and excluding constant terms yields 



where 



Finally, solving for d results in 

-2Tr{VD} 



mind^ ||D|p +4dTr{VD} 

d 



fe=i 



fe=2 



(47) 



(48) 



(49) 



d 



IIDI 



Ef=r ^ {S.-n - Ef=2 Tr {S-} + Ef=i {8,"^ - Ef=2 Tr^ {sf^l } 








Q-l 





(50) 



where we have used (36). 

Simulation results presented in Section VI shows promising performance gains. While we adopted a particularly 
simple class of estimators in (44) more advanced estimator classes can likely be treated as well. For example, 
state-of-the-art methods for covariance estimation in decomposable graphical models involve the use of Bayesian 
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methods and conjugate priors [34], [35]. These distributions depend on tuning parameters that must be chosen 
beforehand or estimated from the available data. Currently, these parameters are chosen through cross vaUdation, 
or empirical Bayes methods. It would be interesting to examine the use of SURE as an alternative. 

V. Positive part estimators 

In the previous sections we proposed estimators which dominate the MLE in terms of MSE. The conditions for 
their existence are similar to those of the MLE, however they may require more samples in order to be positive 
semidefinite. For small sample size, we propose to project these estimators onto the set of decomposable positive 
semi-definite matrices /C in (8). We prove that this projection results in legitimate positive semidefinite estimators 
with better or equal MSE performance. 

Let K be a given estimator of K. Define K as the projection of K onto the set K in (8) 



K = arg min 

KG/C 



K-K 



(51) 



The optimization (51) can be expressed as a semidefinite program (SDP). Therefore, the projected estimator K can 
be efficiently computed using standard SDP optimization packages, e.g., [43]. The following theorem states that 
the projected estimator reduces the error with probability one. 

Theorem 5: Let K be a given estimator of K e /C and define K as its projection in (51). Then, 



K-K < 



K-K 



(52) 



with probability one for all K in the set /C in (8). 

Proof: The proof is based on the convexity of the set /C in (8) and the classical theorem of projection onto 
convex sets (POCS). POCS states that [44] 



Tr 



{(k-k)"(k-k)} 



<o, 



for every K e /C. Adding and subtracting K in the first parenthesis yields 



|K-K||^ <Tr<| (K-K 



Application of the Cauchy Schwartz inequality results in 



K-K < 



K-K 



(K-K 



K-K 



and therefore 



K-K < 



K-K 



(53) 



(54) 



(55) 



(56) 



Since all of the above inequalities apply to any reahzation of the random matrix K, (52) holds with probability 
one. ■ 
When solving (51) is too computationally expensive, we can relax the constraint set and consider the projection 
onto the semidefinite cone: 

2 



min 

K>-0 



K-K 



(57) 
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Similarly to Theorem 5, the semidefinite cone {K ^ O} is a convex set and the solution to (57) dominates K in 
terms of MSE. Its main advantage is that it satisfies a simple closed form. Let 

K = UDU^ (58) 

be the eigenvalue decomposition of K where U is a unitary matrix and D is a diagonal matrix with the eigenvalues 
[D] ... Then, the projected estimator is equal to 

K+ = UD+U^ (59) 

where D+ is a diagonal elements with the elements [D+] = niax{[D]^. , 0}. Due to its similarity to the positive- 
part shrinkage estimator in James-Stein regression, we refer to (59) as the positive-part estimator. 

VI. Numerical results 

Here we present results of numerical experiments in order to illustrate the performance of the above estimators. A 
standard benchmark used for testing (inverse) covariance estimation and covariance selection is the call center data 
set [35], [21], [27]. Our goal is to demonstrate the estimators precision rather than the model selection accuracy. 
Therefore, we estimate the true call center covariance matrix using fixed decomposable models as proposed and 
discussed in [35]. Next, we artificially generate n independent and identically distributed reahzations of jointly 
Gaussian vectors which follow the true covariance structure. We repeat this procedure 100 times and report the 
average performance over these independent trials. We use the three decomposable graphical models analyzed in 
[35]: 

1) Two coupled cliques: Ci = {1, • • • , 70} and C2 = {61, • • • , 100}. 

2) Banding: an non- stationary autoregressive model with p = 239 and cUques Ck = {j,- " > i + L} for j = 
1) ■ ■ ■ — P with an empirically vaUdated bandwidth of L = 20. 

3) Differential banding: an empirically validated and refined banding model in which the first 58 cliques have 
a bandwidth of i = 14 and in the following cliques the bandwidth is equal to L = 4. 

Throughout the simulations, we compare the performance of three estimators: the MLE in (24), the MVUE in (26), 
and the SURE based estimator in (44) with d given by (50). At each reaUzation, we compute the estimators and 
check their semi-definiteness. When an estimator is not positive semidefinite, we resort to its positive part projection 
defined in (57). In Fig. 1-3 we present the normaUzed MSE defined as ||K — K||^/||K|p as a function of the sample 
size n. 

It is easy to see the significant MSE performance advantage of the MVUE and the SURE based estimators of 
K as compared to the MLE. The gain is most significant when the number of samples is small. In this regime, the 
MLE performs poorly and is actually worse than the zero estimator, i.e., K = which ignores the observations 
altogether, whereas the newly proposed estimators provide reasonable performance. In small sample sizes the MVUE 
and SURE based estimators had to be adjusted using their positive part variants. Simulation results (not reported) 
suggest that the improvement in MSE due to the positive part adjustment is negligible. 
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200 



Fig. 1. Two cliques model; significant MSE improvement with respect to MLE. 



VII. Conclusion and future work 

In this paper, we suggested several alternatives to the MLE for concentration estimation in decomposable graphical 
models. We derived the MVUE and further proposed two biased estimators that have lower MSE than the MLE. 
The suggested estimators have simple closed form solutions and their computational complexity is similar to the 
MLE. In addition, we generalized SURE to decomposable graphical models. 

Throughout this work, we assumed that the graphical model is decomposable and illustrated our results for 
practical signal processing examples, e.g., banded and arrow structured concentration matrices. Moreover, any 
conditional independence graph can be approximated as decomposable using available graph theoretical tools. An 
important challenge for future work is the extension of our results to non-decomposable graphs. 
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Fig. 2. Banding model: significant MSB improvement with respect to MLE. 



Appendix I 
Sparsity in the concentration matrix 

In this section, we prove that for the Gaussian distribution conditional independence is equivalent to sparsity 
in the inverse covariance. In order to simpUfy the notation we define a = [x]-, b = [x]j, and c = [x]y^j ^. The 
conditional covariance of a and b given c is defined as 



p„6|c = E{{a-E{a\c}){b-E{b\c})\c}. 
For the Gaussian distribution this covariance has a well known formula [38, Theorem 10.2] 

p,t\c = E {ab} - E {ac^} [E {cc^}] E {cb} . 
On the other hand, the joint covariance of a, b, and c is defined as 



(60) 



(61) 



aa ac^ ab 

K-^=eI ca cc^ cb y (62) 
ba bc^ bb 

Using the matrix inversion lemma for partitioned matrix [38, page 572], the top right element of K is equal to 



[^]o6 = Qa\cPab\cQb\o 



(63) 
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Fig. 3. Differential banding model: significant MSB improvement with respect to MLE. 



where pab\c is the conditional covariance and 



Qa|c 



Qb\o 



E{aa}-E{ac^}[E{cc^}] ^ E {cb} 



E 



{bb}-E{b[ a ]} 



E 



a c 



E{cb} 



(64) 



(65) 



Therefore, if the conditional covariance is equal to zero then the corresponding element in the concentration matrix 
is also equal to zero: 



Pab\c 



^ [KU = 0. 



(66) 



Appendix II 

MVUE AND SURE IN THE NATURAL EXPONENTIAL FAMILY 

A natural exponential family is defined as 

/(x;0) =A;(x)e^^"-*W. (67) 
Its natural parameter is and x is a complete sufficient statistic. The MVUE for estimating given x is [32] 

^MVUE = -^log(fc(x)). (68) 
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For any continuously differentiable function of x denoted by hi (x), the following SURE identity holds [32], [33] 



E{hi (x).[0]J = £;|/i, (x) 



91og (fc (x)) 



dhj (x) 



(69) 



Plugging in the MVUE yields 



E{hi {^)-[e],} = E{h, (x) 



MVUE 



Tr^'^}}, (70) 



which has an intuitive interpretation: an unbiased estimate of hi (x) [6]^ is obtained by simply replacing [d]- with 
its MVUE and adding a correction term. 

The Wishart distribution in (18) belongs to the natural exponential family with parameter 6 being a p^p^^^ vector 
with the elements in the upper triangular part of K, and variable x being a p^p^^^ vector with the elements in 
the upper triangular part of S with —1 and —2 factors on the diagonal and off-diagonal elements, respectively. 
Therefore, the symmetric matrix version of the MVUE in (68) is 

Kmvue = 2Vlog(fc(S)) 

= 2Vlog|S|''^ 

= (n-p-l)S-i (71) 

where V is a symmetric differential operator 

\W].. = l ^ (72) 
L hj ] i_a_ ^ ^ 

This operator compensates for the factor 2 in the off diagonal elements of the derivatives since 

d 



log|S|=[S-i],.+ [S-i]., = 2[S-i],. (73) 



d[S]ij 

for symmetric matrix S and i j. 

Similarly, the SURE expression in (70) is given by [16], [45], [17] 

i;{Tr{H(S)K}} = £;{Tr{(n-p- 1)H(S)S-^ +2VH(S)}}, (74) 

and plugging in the MVUE yields 

E {Tr {H (S) K}} = E {ti- {h (S) Kmvue + 2VH (S)} } . (75) 

Gaussian graphical models also belong to the natural exponential family with parameter being a vector with 

the non-zero elements in the upper triangular part of K, and variable x being a vector with the corresponding 

elements of S and their correction factors. This holds for any Gaussian graphical model, but is not useful unless 

we can evaluate the function k {■) in (67) and its derivatives. In the case of decomposable models A: (•) has a simple 

closed form. Indeed, plugging (18) into (22) yields 

K _ _ K 

log(fc(S)) = 5^1og|Sfe|''^ -^loglSfcl''^. (76) 
fe=i fe=i 
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Therefore, the MVUE of K is 

Kmvue = 2Vlog(A;(S)) 

K ^ 

= ^[(n-Cfc-l)S^i]°-5][(n-s,-l)Sf4] , (77) 

fe=l fe=2 

and SURE is obtained by modifying the differential operator in (75) to take into account only the non-zero elements 
of K as expressed in (30). 



Appendix III 
Technical inequalities 

For simplicity, we partition the submatrix of the fcth clique as 

A B 



Sfe = 

Proof of (38): Using the partitioned matrix inverse 

-1 

A B 



B^ 



>[fe] 



B^ 



(78) 



(79) 



where A = A - BSJ^Jb"^. Therefore, 

Tr {S,-^} = Tr {A-i} +Tr {Sf,[} +Tr {S;-,[B^A-iBSf,i} > Tr {sf,[} , (80) 

where the last inequahty is due to the positive semidefiniteness of S/j ^ and its Schur complement A ^ 0. 
Proof of (39): Using the partitioned matrix inverse once again, we obtain 

Tr{S,2} = Tr{A'2|^2Tr{A-iBS[-,^B^A-i}+Tr{S[,2} 

+2Tr {S[,fB^A-iBS[,J } + Tr | (S[,|B^A-^BSf4)'| > Tr {sf,^} . (81) 
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